Clustering is difficult only when it does not matter' 

Amit Daniely t Nati Linial * Michael Saks § 

May 23, 2012 

<N 

o 

(N 

^ Abstract 

Numerous papers ask how difficult it is to cluster data. We suggest that the 
more relevant and interesting question is how difficult it is to cluster data sets 
£N) that can be clustered well. More generally, despite the ubiquity and the great 

importance of clustering, we still do not have a satisfactory mathematical theory 
^ of clustering. In order to properly understand clustering, it is clearly necessary to 

develop a solid theoretical basis for the area. For example, from the perspective 
^ of computational complexity theory the clustering problem seems very hard. 

Numerous papers introduce various criteria and numerical measures to quantify 
the quality of a given clustering. The resulting conclusions are pessimistic, since 
it is computationally difficult to find an optimal clustering of a given data set, if 
I we go by any of these popular criteria. In contrast, the practitioners' perspective 

Q\ is much more optimistic. Our explanation for this disparity of opinions is that 

complexity theory concentrates on the worst case, whereas in reality we only care 
for data sets that can be clustered well. 

We introduce a theoretical framework of clustering in metric spaces that 
{N) revolves around a notion of " good clustering" . We show that if a good clustering 

exists, then in many cases it can be efficiently found. Our conclusion is that 
contrary to popular belief, clustering should not be considered a hard task. 
Keywords: Cluster Analysis, Hardness of clustering, Theoretical Framework 
S_i for clustering, Stability. 
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1 Introduction 



Clustering is the task of partitioning a set of objects in a meaningful way. Notwith- 
standing several recent attempts to develop a theory of clustering (e.g. [1, 4, 9]), our 
foundational understanding of the matter is still quite unsatisfactory. 

The clustering problem deals with a set of objects X that is equipped with some 
additional structure, such as a dissimilarity (or similarity) function w:XxI->[0, oo). 
Informally, we are seeking a partition of X into clusters, such that objects are placed 
in the same cluster iff they are sufficiently similar. Here are some concrete popular 
manifestations of this general problem. 

1. A very popular optimization criterion is A;- means. Aside from X and w one is 
given an integer k. The goal is partition X into k parts Ci, . . . , C*. and find a 
center Xj £ C\ in each part so as to minimize ^ 12 y eCi w2 hji x i)- Other popular 
criteria of similar nature are fc-medians, min-sum and others. 

2. Many clustering algorithms work "bottom up". Initially, every singleton in X 
is considered as a separate cluster, and the algorithm proceeds by repeatedly 
merging nearby clusters. Other popular algorithms work "top down": Here we 
start with a single cluster that consists of the whole space. Subsequently, existing 
clusters get split to improve some objective function. 

3. Several successful methods use spectral methods. One associates a matrix (e.g. 
a Laplacian) to (X,w), and partitions X according to the eigenvectors of this 
matrix. 

Approaches to the clustering problem that focus on some objective function, usually re- 
sult in AP-hard optimization problems. Consequently, most existing theoretical stud- 
ies concentrate on designing approximation algorithms for such optimization problems 
and proving appropriate hardness results. 

However, the practical purpose of clustering is not to optimize such objectives. 
Rather, our goal is to find a meaningful partition of the data (provided, of course, that 
such a partition exists). The point that we advocate is that a satisfactory theory of 
clustering, should start with a definition of a good clustering and proceed to determine 
when a good clustering can be found efficiently In this paper, we follow this approach 
when the underlying space in a metric 1 space. 

This perspective leads to conclusions which are at odds with common beliefs regard- 
ing clustering. This applies, in particular, to the computational hardness of clustering. 
The infeasibility of optimizing most of the popular objectives led many theoreticians, to 
the bleak view that clustering is hard. However, we show that in many circumstances 
a good clustering can be efficiently found, leading to the opposite conclusion. From 
the practitioner's viewpoint, "clustering is either easy or pointless" - that is, whenever 

1 The assumption that d is a metric is not too strict. E.g., much of what we do applies even if we 
weaken the triangle inequality to A • d(x, z) < d(x, y) + d(y, z) for A bounded away from zero. 
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the input admits a good clustering, finding it is feasible. Our analysis provides some 
support to this view. 

This work is one of several recent attempts to develop a mathematical theory of 
clustering. For more on the relevant literature, see Section 4. 

1.1 A Theoretical Framework for Clustering in Metric Spaces 

There are numerous notions of clusters in data sets and clustering methods to be found 
in the literature. Although not necessarily stated explicitly, these methods are guided 
by an ideal (in the Platonic sense) notion of a good cluster in a space X. This is a 
subset CCI such that if x G C and y ^ C, then x is substantially closer to C than y 
is. To rule out trivialities we usually require C to be big enough. This, in particular, 
eliminates the possibility of trivial singleton clusters. Even more emphasis is put on 
problems of clustering. Here we seek partitions of the space X into clusters such that 
every x G X is substantially closer to the cluster containing it than to any other cluster. 
The problem is specified in terms of a proximity measure A(x,A) between elements 
x G X and subsets ACX Numerous natural choices for A(-, ■) suggest themselves. 
For example, if X is a metric space, it is reasonable to define A(x, A) in terms of x's 
distances from members of A. 

In the present paper we consider a metric space (X, d) from which data points are 
sampled 2 according to a probability distribution P. The definition we adopt here is 
A(x,A) = E y ^p[d(x,y)\y G A]. Other interesting definitions suggest themselves, e.g., 
A'(x, A) = inf yeA \ {x} d(x, y). 

A technical comment: The definition of A(x, A) depends on the distribution P. To 
simplify notations we omit subscripts such as P when they are clear from the context. 

Formally, we say that C C X is an (a, 7)-cluster for a > 0, 7 > 1 if P(C) > a 
and for (almost-) every 3 x G C,y C , 

A{y,C)> 1 -A{x,C). 

Likewise, a partition C = {C±, . . . ,Ck} of X is an (a, 7)-clustering for some a > 
0, 7 > 1 if 

A{x,Cf) > 7 .A(x,d) 

for every i 7^ j and (almost-) every 16^ and, in addition, P{Cj) > a for every i. 
A few technical points are in place. 

• We study (a, 7)-clusterings of a space as well as partitions of a space into (a, 7)- 
clusters. We note that although these two notions are similar, they are not 
identical. 

2 In certain cases it is inappropriate to assume that points of X are drawn at random. It is also 
possible that we do not know how X is sampled. In such circumstances, we consider P as the uniform 
distribution on X. 

3 Almost means, as usual, that we are allowing an exceptional set of measure zero. 
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• Our results hold if we choose instead to define A(x, A) as E[d(x, y)\y G A \ {x}}. 
This definition is perfectly reasonable, but it leads to certain minor technical 
complications that the current definition avoids. Moreover, the difference between 
the two definitions is rather insignificant, since our main interest is in cases where 
P({x}) < P{A). 

Our main focus here is on efficient algorithms for finding (a, 7)-clusters and clusterings. 
The analysis of these algorithms rely on the structural properties of such clusters. We 
can now present our main results. To simplify matters without compromising the big 
picture, we state our theorems in the case when X is a given finite metric space. 

Theorem 1.1 For every fixed 7 > 1, a > there is an algorithm that finds all (a, 7)- 
clusterings of a given finite metric space X and runs in time poly(\X\). 

Theorem 1.2 There is a polynomial time algorithm that on input a finite metric space 
X and a > finds all 7 -clusters in X with 7 > 3 and a partition of X into (a, 7)- 
clusters with 7 > 3 ; provided one exists. Moreover, the latter problem is NP-hard for 
7 = 5/2. 

1.2 An overview 

Our discussion splits according to the value of the parameter a. When a is bounded 
away from zero we work by exhaustive sampling (e.g. as in [2]). We first sample a small 
set of points S from the space. Since \S\ is small (logarithmic in an error parameter), it 
is computationally feasible to consider all possible partitions II of S. To each partition 
II of S we associate a clustering that can be viewed as the corresponding "Voronoi 
diagram". If the space has an (a, 7)-clustering C, let II* be the partition of S that is 
consistent with C. We show that the "Voronoi diagram" of II* nearly coincides with C 
provided that 7 is bounded away from 1. Concretely, Lemma 2.2 controls the distances 
between points that reside in distinct clusters in an (a, 7)-clustering. Together with 
Hoeff ding's inequality this yields Lemma 2.3 and Corollary 2.4 which show that the 
"Voronoi diagram" of an appropriate partition of a small sample is nearly an (a, 7)- 
clustering. Lemma 2.5 speaks about the collection of all possible (a, 7)-clusterings of 
the space. It shows that every two distinct (a, 7)-clusterings must differ substantially. 
Consequently (Corollary 2.6) there is a bound on the number of (a, 7)-clusterings that 
any space can have. All of this is then used to derive an efficient algorithm that can 
find all (a, 7)-clusterings of the space, proving Theorem 1.1. 

In section 3 we deal with the case of small a. This affects the analysis, since we 
require that the dependency of the algorithm's runtime on a be poly(-). We show that 
(a, 3 + e)-clusters are very simple: Such a cluster is a ball and any two such clusters that 
intersect are (inclusion) comparable. These structural properties are used to derive an 
efficient algorithm that partitions the space into (a, 3 + e)-clusters (provided that such 
a partition exists), proving the positive part of Theorem 1.2. To match this result, we 



3 



show that finding a partition of the space into (a, 2.5)-clusters is NP-Hard, proving 
Theorem 1.2 in full. 

Lastly, in section 4 we discuss some connection to other work, both old and new, 
as well as some open questions arising from our work. 

2 Clustering into Few Clusters — a is bounded away 
from zero 

Throughout the section, X is a metric space endowed with a probability measure P. To 
avoid confusion, other probability measures that are used throughout, are denoted by 
Pr. We define a metric d between two collections of subsets of X, say C = {C\, . . . , Ck} 
and C = {C[,..., C' k }. Namely, d(C, C) = min P{U k i=1 C t © C' a(i) ) where A®B denotes 
symmetric difference, and the minimum is over all permutations o G Sfc. The definition 
of d(C, C) extends naturally to the case where C and C have k resp. I sets and, say 
I < k. The only change is that now a : [I] — >■ [k] is 1 : 1. 

We define A also on sets. If A,B C X, we define A(A,B) as the expectation 
of d(x, y) where x and y are drawn from the distribution P restricted to A and B 
respectively. It is easily verified that A is symmetric and satisfies the triangle inequality. 
It is usually not a metric, since A (A, A) is usually positive. 

Proposition 2.1 For every A,B,C C X, 



As the following lemma shows, distances in an (a, 7)-clustering are fairly regular 
Lemma 2.2 Let C\, . . . , be an (a, 7) -clustering and let i 7^ j . Then 

1. For almost every x e Q,y e C j; ^A(?/,C 4 ) < d(x,y) < ^^A(y,Ci) 

2. For almost every x, y £ C i} d(x, y) < ■ A(x, Cj) 

Proof. Let x G C^y G Cj. For the left inequality in part 1, note that 



A(A, B) = A(B, A) and A(A, B) < A(A, C) + A(C, B) 



d{x,y) > A(y,d) 



\-A{x,C 3 ) 



7 



> A(y,Ci) 



-■[d(x,y)+A(y,C j )} 
-■[d{x,y) + --A{y,C i )} 



> My, a) 



7 7 
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For the right inequality, 



d(x,y) < 



A(x,C l ) + A( 2/ ,C,) 
-•A^C.O + A^a) 

-■(d(x,y) + A(y,C j )) + A(y,C i ) 
-■(d(x,y) + --A(y,Q)) + A(y,C l ) 



< 



< 



< 



7 7 



For part 2, 



d(x,y) < 



A(x,C i ) + A(y,C7i) 
i-^C.O + A^Q)] 

-•p-A^C.O + ^y)] 



< 



< 



□ 



Note that for 7 — > 00 all distances d(x, y) with zeCj and y G Cj are roughly equal 
and d(xi,X2) <C d(xi,y) for all 27, £2 G Cj and y G Cj with i 7^ j. 

We show next how to recover an (a, 7)-clustering by sampling. For x G X and 
A C X finite, we denote the average distance from x to A's elements by Ajj(x, A) : = 
\X\ ^2y&A^( x ^y)- A finite sample set S provides us with an estimate for the distance 
of a point x from a (not necessarily finite) CCI. Namely, we define the empirical 
proximity of x to C as A emp (x, C) := Au(x, C (1 S). 

We turn to explain how we recover an unknown (a, 7)-clustering of X with a > 
and 7 > 1. Consider a collection C = {C\, . . . , Ck} C X of disjoint subsets of X. We 
define a "Voronoi diagram" corresponding to S, denoted C 7 = {C 7 , . . . , Cj 7 }- Here 



If C is a (a, 7)-clustering of X, we expect C 7 to be a good approximation of C. 

Lemma 2.3 Let C = {C\, . . . , C^} be an (a, 7)- clustering of X. Let S = {Z\, . . . , Z m } 
be an i.i.d. sample with distribution P and let q ^ p. Then, for every x G C q ,e > 0, 



The proof follows by a standard application of the Hoeffding bound and is deferred to 
the appendix. 

Corollary 2.4 Let S = {Z±, . . . , Z m } be an i.i.d. sample with distribution P. Then, 



{x G X : Vj ^ i, 7 ■ A emp (x, Ci) < A emp (x, Cj)}. 



P (A emp (x, C p ) > (7 - e) • A emp (x, C q )) > 1 - 3 exp 




for every (a, j)- clustering C, Pr(d(C,C 



)>*)< 



— ta 
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Proof. Denote C = {Ci, . . . , C^}. By lemma 2.3, with e = 5, we have 
E[d(C,C^ s )] = E[P{U k t=1 C i @Cr & )} 

= W Pr(x £ C7- 5 )rfP(x) 



i=l 
A' 



= [ Pr ( x G cj- 5 )dp(x 

1=1 ''Ci 

< p-l).p( CW .3. (B p(-(-^^ J " 



Thus, the lemma follows from Markov's inequality and the fact that k — l<k<^\3 
We next turn to investigate the collection of all (a, 7)-clusterings of the given space. 
We observe first that every two distinct (a, 7)-clusterings must differ substantially. 

Lemma 2.5 If C,C are two (0,7) -clusterings with d(C,C) > 0, then d(C,C) > 

M7-1) 2 
2 7 2_ 7+1 - 

Proof. Denote C = {C 1; . . . , C k }, C = {C[, . . . , C' k ,} and e = d(C, C). By adding 
empty clusters if needed, we can assume that k = k'. By reordering the clusters, if 
necessary, we can assume that P(U^ =1 Ci © C^) = e and P{C[ © Ci) > 0. Again by 
selecting the ordering we can assume the existence of some point x that is in C[ \ C\ 
and in C2 \ C 2 . 



A(z,CD = j^-y J^d(x,y)dP(y) 



> ^.A(x, Cl )-^^. mBx d{x,y) (1) 
\ aJ a 7(7 — 1) 

>- H-^m-^ 

For the second inequality note that pj^j > P ^ Cl \{Ch)^ Cl) — ^~ a" The third inequality 
follows from lemma 2.2. 
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As we just saw ^(Ic-l,) — [} ~ a ' ^(i-i) 1 ) ' 7' ^ e same argument yields as well 

Hi > (l - | • ^fer) " 7- Cousequeutly 1 > (l - f • • 7 whieh proves 

the lemma. □ 

As we observe next, for every a > and 7 > 1 the number of (a, 7)-clusterings 
that any space can have does not exceed f(a, 7), where / depends only on a and 7 but 
not on the space. We find this somewhat surprising, although the proof is fairly easy. 

Corollary 2.6 There is a function f = /(a, 7) defined for a > and 7 > 1 
with the following property. The number of (a, ^-clusterings of any metric prob- 
ability space X is at most f(a, 7). This works in particular with f(az, 7) = 2 • 

( v^ 7 ( 7 2 + l) \ 2 , ,1s 

i2(2 7 2 - 7 +i) U (7-p* a 7 ^ 
Q 2( 7 _i)2 y 

Proof. Consider the following experiment. We take an i.i.d. sample Z\, . . . ,Z m of 
points from the distribution P with 

+ 1)V , /12(27 2 -7 + l) 



m > — ; tt^ — • In 



( 7 -l) 2 a J \ a 2 ( 7 -l) 2 

and partition them randomly into k < (-) parts Ti, . . . , T^. This induces a partition 
C* = {Ci, . . . , Cfc} of the space X defined by 

Ci = {xeX:Vj^ 1, Au(x,T t ) < Au{x, 3})} 

For every (a, 7)-clustering C of X we consider the event Ac that the induced partition 
of X satisfies d(C,C*) < a ■ 2 .(2^-\+i) • us cons ider the events A c over distinct 
(a, 7)-clusterings of the space. By Lemma 2.5, these events Ac are disjoint. Now 
consider the event B that the Tj's are consistent with C. There are at most (-) m ways 
to partition the sampled points into - parts or less, so that Pr(S) > a m . By the choice 
of m and by Corollary 2.4 Pt(A c \B) > \. Thus, Pr(A c ) > Pr(B) • Pt(A c \B) > \a m . 
Consequently, X has at most f(a, 7) = 2(^) m distinct (a, 7)-clusterings, as claimed. 
□ 

Note 2.7 Fix a > 0. The number of (a, ^-clusterings might be quite large when 7 is 
close to 1. For example, let X be an n-point space, with uniform metric and uniform 
probability measure. Every partition in which each part has cardinality > a ■ n is an 
(a, - clustering 1 . 



4 Note that this example is not valid if we define A(x, A) — E[d(x, y)\y £ A \ {%}]. To overcome 
this point, we can replace every point x £ X by many copies, where two copies of x are distance e 
and a copy of x and a copy of y x are at distance d(x, y). 
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Algorithmic Aspects 

Fix a > 0, 7 > 1. We shall now show that an (a, 7)-clustering can be well approximated 
efficiently. By lemma 2.4, (a, 7)-clustering can be approximated by a small sample, 
where the approximation is with respect to the symmetric difference metric. A major 
flaw of this approximation scheme is that we have no verification method to accompany 
it. We do not know how to check whether a given partition is close to an (a, 7)- 
clustering w.r.t. the symmetric difference metric. To this end, we introduce another 
notion of approximation. A family of subsets of X, C = {C\, . . . , Ck}, is an (e, a, 7)- 
clustering if 

• For every % G [k], P{Ci) > a 

• There is a set iV C X with P{N) < e such that every x G X \ N, belongs to 
exactly one Ci and for every j 7^ i, A(x, Cj) > 7 • A(x, Ci). 

We consider next a partition that is attained by the method of Corollary 2.4. We 
show that if it is e-close to an (a, 7)-clustering w.r.t. symmetric differences, then it is 
necessarily an (a — e, 7 — 0(e), e)-clustering. 

We associate with every collection A = {A\, . . . ,Ak} of finite subsets 5 of X the 
following collection of subsets &{A) = {CJ(A), ...,C%(A)}: 

CJ(A) - {x G X : Vj ^ i, 7 • A u (x,A i ) < A v (x, Aj)} (2) 

where, as above, Au(x, A) := rjr YIiz&a ^( x > z )- 

Proposition 2.8 Let C = {Ci, . . . , Ck} be an (a, 7)- clustering. Let A = {Ai, . . . , A^} 
whereWi, Ai C Ci and diC 1 (A) , C) < e. ThenC^'^A) is an (a — 6,7 — 0(e), e)- clustering. 
The unspecified coefficients in the O-term depend on a and 7. 

The main idea of the proof is rather simple: The assumption d(C J (A),C) < e implies 
that for all i the set Ci@C]{A) is small. This suggests that A(x, Ci) & A(x, C](A)) for 
most points 16I. The only difficulty in realizing this idea is that points in CiQ)Cj(A) 
might have a large effect on either A(x,Ci) or A(x,0 4 7 (^4)). But the assumption that 
Ai C Ci gives us control over the distances between x to these points. The full proof 
can be found in the appendix. 

To recap, the above discussion suggests a randomized algorithm that for a given 
e > runs in time poly(-) and finds w.h.p. an (a — e, 7 — 0(e), e)-clustering of X 
provided that X has an (a, 7)-clustering C. We take m = 8(log(-)) i.i.d. samples 
from X and go over all possible partitions of the sample points into at most - sets. 

There are only Q)°( log( -«* ) ' ) such partitions. We next check whether the clustering of 
X that is induced as in Equation (2) is an (a — e, 7 — 0(e), e)-clustering (this can be 
easily done by standard statistical estimates). 

5 In fact, we will allow A\, . . . , to have multiple points. Formally, then, A\, . . . , Ak are multisets. 
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To see that the algorithm accomplishes what it should, note that the failure proba- 
bility in corollary 2.4 with 5 = 7 — 1 can be < | for m = 0(log(^)). Thus, w.p. > \ one 
of the considered partitions induces a partition of X which is e-close in the symmetric 
difference sense to C. By Proposition 2.8, this partition is an (a — e, 7 — 0(e), e)- 
clustering. 

This also proves Theorem 1.1: If our input is a finite metric space X, we can apply 
the above algorithm with e = rw= arid examples that are being sampled from X 
uniformly at random. As explained, w.h.p., the algorithm will consider every partition 
which is e-close in the symmetric difference sense to any of X's (a, 7)-clusterings. 
However, since e = ryj+i > ^ wo e-close partitions must be identical. This proves Theorem 
1.1. 

Note that by corollary 2.6, all the (a, 7)-clusterings can be approximated. A similar 
algorithm can efficiently find an approximate (a, 7, e)-clustering, provided that one 
exists 5 . Also, similar techniques yield an algorithm to approximate an individual (a, 7)- 
cluster. 

3 Clustering into Many Clusters 

To simplify matters we consider only finite metric spaces endowed with a uniform 
probability distribution 7 . 

Lemma 3.1 Let X be a metric space and let e > 0. 

1. LebC x ,C 2 QX be two (3 + e) -clusters. Then C x n C 2 = 0, C x C C 2 or C 2 C C x . 

2. Every (3 + e) -cluster is a ball around one of its points. 

3. The claim is sharp and the above claims need not hold for e = 0. 

Proof. We prove the first claim by contradiction and assume that P{G\ \ C 2 ), P{C 2 \ 
Ci), P{G\ H C 2 ) are positive. Let x G C\ fl C 2 , y G G\ © C 2 be such that d(x, y) is as 
small as possible. Say that y G C 2 . Clearly, A(x, G\ \ C 2 ) > d(x, y). 

6 Thc main difference is that here we do not consider partitions of the whole sample set. Rather, we 
seek first those sample points that belong to the exceptional set, and only partitions of the remaining 
sample points are considered. 

7 As in the previous section, it's a fairly easy matter to accommodate general metric spaces and 
arbitrary probability distributions. 



9 



We first deal with the case P(Ci \ C 2 ) > P(C 1 fl C 2 ), and arrive at a contradiction 
as follows: 

> ^A(x,C 1 \C 2 ) 

> ^[A^CO-A^d)] 

> ^p^A^C,) 

When P(Ci \ C 2 ) < -P(Ci H C 2 ), a contradiction is reached as follows. By the choice 
of x, y, for every z E C\ \ C 2 , there holds A(z, C\ fl C 2 ) > y). Therefore, 

A(,,c l} = ^A^CA^ + ^A^^nc,) 

> ^A^dnc,) 

> ^[A^CO-A^CO] 

> ^(l-^-A^C,) 

> ^(i-^Ms + ^.a^co 

To prove the second part, let C be a (3 + e)-cluster of diameter r, and let x,y <E C 
satisfy d(x,y) = r. Since d(x,y) < A(x,C) + A(y,C), we may assume w.l.o.g. that 
A(x,C) > ^|^. We show now that C = B(x,r) and C is a ball, as claimed. Indeed 
d(x,z) < r for every z E C, and if z ^ C, then d(x,z) > A(z,C) — A(x,C) > 
(3 + e — 1) A(rr, C) > y) = r. The conclusion follows. 

To show that the result is sharp, consider the graph G that is a four-vertex cycle 
and its graph metric. It is not hard to check that every two consecutive vertices in G 
constitute a 3-cluster which is not a ball. Moreover a pair of intersecting edges in G 
yield an example for which the first part of the lemma fails to hold. □ 

An (a, 7)-cluster in a space X is called minimal if it contains no (a, 7)-cluster 
other than itself. Such clusters are of interest, since they can be viewed as "atoms" in 
clustering X. 

Corollary 3.2 For every a, e > and every space X there is at most one partition of 
X into minimal (a, 3 + e)-clusters. 
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To see this, consider two (a, 3 + e)-clusters C and C that belong to two different such 
partitions and have a nonempty intersection. By Lemma 3.1, they must be comparable. 
By the minimality assumption, C — C which proves the claim. 

Note 3.3 We note that the previous Corollary may fail badly without the minimality 
assumption. Let X = {x±, . . . ,x n }U{yi, . . . ,y n }, where d(xi,yi) = 1 for all i and all 
other distance equal 7. It is not hard to see that the following are (a, 7) -clusters in X 
where a = j-: A singleton and a pair {xi,yi}. There are 2 n = 2^ ways to partition 
X into such clusters. 

Algorithmic Aspects 

We next discuss several algorithmic aspects of clustering into arbitrarily many clusters. 
Our input consists of a finite metric space X and the parameter a > 0. Lemma 3.1 
suggests an algorithm for finding (a, 3 + e)-clusters and for partitioning the space into 
(a, 3 + e)-clusters. The runtime of this algorithm is polynomial in \X\, and independent 
of a. The second part of the lemma suggests how to find all the (a, 3 + e)-clusters. As 
the first part of the lemma shows, the inclusion relation among the (a, 3 + e)-clusters 
has a tree structure. Thus, we can use dynamic programming to find a partition of 
the space into (a, 3 + e)-clusters, provided that such a partition exists. This proves the 
positive part of Theorem 1.2. 

To match the above positive result, we show 

Theorem 3.4 The following problems are NP-Hard. 

1. (a,2.5)-CLUSTERING: Given an n-point metric space X and a > 0, decide 
whether X has a (a, 2. 5) -clustering. 

2. PARTITION-INTO-(a,2.5)-CLUSTERS: Given an n-point metric space X and 
a > 0, decide whether X has a partition into (a, 2. 5) -clusters. 

The proof of this Theorem, which also proves the negative part of Theorem 1.2, is 
deferred to the appendix. 

4 Conclusion 

4.1 Relation to other work 

As we explain below, our work is inspired by the classical VC/PAC theory. In addition 
we refer to several recent papers that contribute to the development of a theory of 
clustering. 
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VC/PAC theory 

The VC/PAC setting offers the following formal description of the classification prob- 
lem. We are dealing with a space X of instances. The problem is to recover an 
unknown member h* in a known class H of hypotheses. Here % C y , where y 
is a finite set of labels. We seek to recover the unknown h* by observing a sample 
S = {(x i) h*(xi)}™ =l C X x y. These samples come from some fixed but unknown 
distribution over X. 

Our description of the clustering problem is similar. We consider a space X of 
instances and a class Q of good clusterings (P,C) of X, where P is probability measure 
over X and C is a partition of X. We are given a sample {X\, . . . , X m } C X that comes 
from some unknown P, where (P, C) £ Q for some partition C, and our purpose is to 
recover C. Specifically, here X is a metric space, Q is the class of probability measures 
P that admit a partition which is a (a, 7)-clustering and the corresponding partition 
is the associated (a, 7)-clustering. 

Both theories seek conditions on Q or % under which there are no information 
theoretic or computational obstacles that keep us from performing the above mentioned 
tasks. 

Alternative Notions of Good Clustering 

Our approach is somewhat close in spirit to [4], see also [(>]. These papers assume 
that the space under consideration has a clustering with some structural properties, 
and show how to find it efficiently In particular, a key notion in these papers is the 
7-average attraction property, which is conceptually similar to our notion of 7- 
clustering. Given a partition C — {Ci, . . . , C^} of a space X it is possible to compare 
between clusters either additively or through multiplication. In [4] the requirement is 
that A(x,Ci) + 7 < A(x,Cj) for every x £ C, and j 7^ i, whereas our condition is 
A(x, Ci) ■ 7 < A(x, Cj). A clear advantage of our notion is its scale invariance. On the 
other hand, their algorithms work even if X is not a metric space and is only endowed 
with an arbitrary dissimilarity function. 

We mention two more papers that share a similar spirit. Consider a data set that 
resides in the unit ball of a Hilbert Space. It is shown in [8] how to efficiently find a 
large margin classifier for the data provided that one exists. In [1] several additional 
possible notions of good clustering are introduced and analyzed. 

Stability 

The notion of instance stability was introduced in [5] (See also [3]). An instance for 
an optimization problem in called stable if the optimal solution does not change (or 
changes only slightly) upon a small perturbation of the input. The point is made that 
instances of clustering problems are of practical interest only if they are stable. The 
notion of an (a, 7)-clustering has a similar stability property. Namely, if we slightly 
perturb a metric, an (a, 7)-clustering is still (a', 7')-clustering for a' ~,a, 7' ~ 7. 
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Thus, a good clustering remains a good clustering under a slight perturbation of the 
input 

In fact, the present paper is an outgrowth of our work on stable instances for 
MAXCUT, which we view as a clustering problem. We recall that the input to the 
MAXCUT problem is an n x n nonnegative symmetric matrix W . We seek an 5 C [n] 
which maximizes Yliesjgs Wi r Even METRIC-MAXCUT problem (i.e., when u>y form 
a metric) is iVP-Hard (]. We say that W is a 7-perturbation of W some 7 > 1 if 
Vi,j, 7~2Wjj < < 'yzyjij. The instance W of MAXCUT is called 7-stable if the 
optimal solution S for W coincides with the optimal solution for every 7-perturbation 
W of W. The methods presented in this paper can be used to give, for every e > 0, 
an efficient algorithm that correctly solves all (1 + e)-stable instances of METRIC- 
MAXCUT. 

These developments will be elaborated in a future publication. 
4.2 Future Work and Open Questions 

In view of this article and papers such as [1, 8, i] it is clear that there is still much 
interest in new notions of a good clustering and the relevant algorithms. Still, on the 
subjects discussed here several natural questions remain open. 

1. We believe that it should be possible to improve the dependence on a and 7 of 
the run time of the algortihm in Theorem 1.1. 

2. We gave an efficient method for partitioning a space into 3-clusters, and showed 
(theorem 3.4) that it is iVP-Hard to find a partition into 2.5-clusters. Can this 
gap be closed? 

3. As Lemma 3.1 shows, (3 + e)-clusters are just balls. It is not hard to see that 
Lemma 2.3 implies that given an (a, 7)-clustering of an n-point metric space, it 
is possible to find 7 (logn) representative points in every cluster so that the clus- 
tering is nothing but the Voronoi diagram of the (bunched) representative sets. 
Presumably, there is still some interesting structural theory of (a, 7)-clustering 
waiting to be discovered here. Specifically, can the above 7 (logn) be replaced 
by 7 (1)? A positive answer would give a deterministic version of our algorithm 
from section 2, with no dependency of a, but only on the maximal number of 
clusters. 

4. Consider the following statement "Every n-point metric space X has a partition 
X = AilB such that for every x G A, y 6 B, it holds that 7(71)- A(x, A) < A(x, B) 
and 7(71) • A(y, B) < A(y, A)". How large can 7(71) be for this statement to be 
true? 
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A Proofs omitted from the text 

Proof, (of Lemma 2.3) For A C X, denote I A = : Z j e A}\, J A = 

J2j Zj€A^( x ^j)- For every j e [m] define 

{-p^y • d(x, Zj) Zj e C p 
~~kcq) ' ^( x > Zj) Zj e C q 
otherwise 

We have EYj = A(x, C p ) - (7 - §) ■ A(x, C q ) > f- ■ A(x, C p ). Moreover, by lemma 2.2, 
1^1 < ^ • • A(x, C p ) < £^ ■ A(x, C p ). Thus, by Hoeffding's bound, 



2/ "V"' "9/ — 27 
7 2 -t 

OL 7(7—!) *V.""> ^V) — 0(7- 

2 



1^7(7' 



i). 
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Again by Hoeffding's bound, we have 



P(C q ) ~ 4 7 



2 



P ( < 1 - — ) < exp [ - f 4^ ) • m 



2 



P ( ~Jtpt\ > 1 + — ) < exp [ - ( 4^ ) • m 



P(C P ) ~ 47, 

Combining the inequalities, we conclude that, with probability > 1 — 

( e(7-l)" N " 



3ex p(- U7(V+i) ) - m 



J _Cp_ J Cp P(C P ) 

!c p P(C P ) ic p 7 - 



> iz*.lzl>i 

□ 

Proof (of Proposition 2.8) It is very suggestive how to select the exceptional set in the 
(a— e, 7— 0(e), e)-clustering that we seek. Namely, let N = Uj (C, \ (77(A)). As needed, 
P(N) < e, since <i(C 7 (./l), C) < e. To prove our claim, note that Vz, P{C]) > a — e 
since d(C, CJ(^4.)) < e. Consider some x G A \ N and the unique index i for which 
.x G C*7(^4) flCj. If j 7^ z, we need to show that 

A(x,C7(^))>( 7 -0(e))A(x,C7(^)) 

As in the proof of lemma 2.5, we have 

A(x,C](A)) > (l--)A(x,a)-- max d(x,y) 
3 \ aJ ayec 3 \c](A) 



a 7(7 — 1 
=: (1 - a x ■ e) ■ A(x, C s ) 

Similarly, again as in the proof of lemma 2.5, we have 



A(x, CA > (l - -) A(x, C?(A)) - - max d(x, y) (4) 
- a/ a y &c'<{A)\Ci 



Now, for y G (77(A), we have 



d{x,y) < A u {x,A i )+A u {y,A i ) 

< -A u (x,A j ) + -A u (y,A j ) 

2 1 

< -A u (x,A J ) + -d(x,y) 

7 7 
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Now, since Aj C Cj, by lemma 2.2, Au(x,Aj) < 7~^ A(x, Cj) and we have, 

7 7 2 + 1 2 

dx,y <-L-.-LJ^ .-A(x,Cj) 
7-1 717-1) 7 

So, by equation (4) we have, 

A(x, d) > (1 - a 2 ■ e) • A(z, C/^)) - a 3 ■ e ■ A(x, C,-) (5) 

For some positive constants a 2 , a 3 which depend only of 7 and a. Now by equations 
(3) and (5) we conclude that 

A(x,C](A)) > (l-(a 1 +ja 3 )-e)-A(x,C j )+ja 3 -e-A(x,C j ) 

> (1 - (ai + 703) • e) • 7 • A(x, Cj) + 703 • e • A(x, Q) 

> (1 - {a x + 7 a 3 ) • e)(l - a 2 • e) 7 • A{x, C]{A)) 
= (j-0(e))-A(x,CnA)) 

Proof, (of Theorem 3.4) Both claims are proved by the same reduction from 3- 
DIMENSIONAL-MATCHING (e.g., [7] pp. 221). The input to this problem is a subset 
M G Y x Z x W, where Y, Z, W are three disjoint g-element sets. A three dimensional 
matching (=3DM) is a g-element subset M' C M that covers all elements in YUZUW. 
The problem is to decide whether a 3DM exists. 

We associate with this instance of the problem a graph on vertex set YUZUW, and 
edge set the union of all triangles {y, z, w} over (y, z, w) G M. It is not hard to see 
that 3DM remains iVP-Hard under the restriction that this graph is connected. 

Here is our reduction. Given an instance McYxZxW of 3DM, we construct 
a graph G M = (V M ,E M ) as follows: Associated with every m = (y,z,w) 6 M is 
a gadget below. We consider the clustering problem on G M with its natural graph 
metric. 




We say that a triangle T in a graph is isolated if every vertex outside it has at most 
one neighbor in T. The above gadget is useful for the reduction since it's easy to verify 
that: 
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Claim 1 The graph G can be partitioned into isolated triangles iff M has a 3DM. 

Proof (sketch). If M has a 3DM, we can construct a partition of V into isolated 
triangle by taking the triangles 



for m outside it. On the other hand, consider any partition of G into isolated triangles. 
Its restriction to every gadget must coincide with one of the above two choices, so that 
the corresponding 3DM is readily apparent □ 

Both ./VP-Hardness claims in Theorem 3.4 follow from the above discussion and the 
following claim 

Claim 2 Let G = (V,E) be a connected graph in which all vertex degrees are > 2. For 

• k 

every partition of the vertex set V = U-^Ci, the following are equivalent 

1. Each Cj induces an isolated triangle. 

2. Each Ci is a {-^r, 2. 5) -cluster. 

3. The partition C\, . . . , C& is a (r^r, 2. 5) -clustering. 

Proof The implication 1. =^ 2. and 1. =^ 3. are easily verified. We turn to prove 
3. =>■ 1. Let i 6 [k]. We need to show that each Cj is an isolated triangle. Clearly, 
|C| > 3 by definition of (t^t, 2. 5) -clustering. But G is connected, so there are two 
neighbors xy with x G Ci,y G" Ci. By proposition 2.2 we have 



so that |Cj| = 3. Consider now x,y G Ci which are nonadjacent. Since d(x) > 2, it 
has a neighbor z G" Cj. Using Proposition 2.2 we arrive at the following contradiction: 
1 = d(x,z) > (2.5 — l)A(x, Cj) > 1.5 • ^ = 1.5. We already know that each Cj is a 
triangle, but why is it isolated? If z G Cj, j ^ i has at least two neighbors in Cj, then 



The proof of 2. =^ 1. is similar. Let Cj be cluster in the partition. Using the same 
argument as before, where the the fact that Vx G Ci,y Ci, d(x,y) > A(y,Cj) — 
A(y,Ci) > (2.5 — l)A(x,Cj) replace Proposition 2.2, we deduce that Cj induces a 
triangle. To show that Cj is isolated, suppose that there exists a vertex z £ Ci with 
> 2 neighbors in Cj. Let x G Cj be an arbitrary vertex. To obtain a contradiction, we 
note that 



{y, mi, m 2 }, {z, m 6 , m 8 }, {w, m 7 , m 9 }, {m 3 , m 4 , m 5 } 
for every m in the 3DM and the triangles 

{mi, m 3 , m 6 }, {m 2 , m 4 , m 7 }, {m 5 , m 8 , m g } 



(6) 



(7) 



1 = d(x, y) > (2.5 - 1) A(x, d) > 1.5 ■ 



C|-l 





□ 
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Note A.l Theorem 3.4 is tight in the following sense: As the proof shows, the above 
problems are hard even for graph metrics. On the other hand, given a graph G = (V, E), 
the following polynomial time algorithms find (i) A partition into (a, 2.5 + e)-clusters, 
and (ii) A (a, 2. 5 + e)- clustering, (provided, of course, that one exists). 

1. If a > then, as in the proof of theorem 3.4, one shows that a partition into 
(a, 2.5 + e)-clusters / (a, 2.5 + e)-clustering is equivalent to a perfect matching, 
no edge of which is contained in a triangle. This can be done by first eliminating 
every edge that belongs to a triangle and then running an arbitrary matching 
algorithm. 

2. If a < then clearly there is no partition into (a, 2.5 + e) -clusters / (a,2.5 + e)- 
clustering. If a = j^, the singletons constitute a partition ofV into (a, 2.5 + e)- 
clusters and a (a, 2.5 + e)- clustering. 

Note A. 2 As in Note 2.7, by replacing each vertex with many points at distance e 
from each other, the above reduction applies as well with the definition A(x, A) = 
E[d(x,y)\yeA\{x}]. 
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